We are going to investigate and analyze movies dataset. This data set containes a set of movies and some information about them like popularity of movie, the budget this movie took in its production, the revenue this movie made, the cast, and genres of these movies, and the rating this movie had with the count of voters.
We will clean the dataset and prepare it for analysis first then we will try to apply Exploratory Data Analysis to it and answer some questions about the data so we can get some useful information from it.
Finally we will try to identify coorelation between features and identify the independent variables and what variables are dependent on them so we can prepare a case for prediction and building ML model.
in this report we will investigate the data and try to find answers to these questions:
What are the empact of each factor in revenue from year 2006 to 2015?
What are the Top 10 Movies in terms of Movies count, Revenue, Budget, and Popularity and What are the empact on these movies regarding Revenue?
What are the top 10 Actors in terms of highest revenue, budget, and popularity? and what are the empact of these factors on revenue ?
What are the empact of genres in revenue and budget?
What are the empact of production companies in revenue and budget?
# Loading Necessary Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
# Loading Data
df = pd.read_csv('drive/My Drive/tmdb-movies.csv')
df.info()
From the previous output we can notice that there are missing data like in cast, homepage, director, tagline, keywords, overview, genres,and production companies. We can say that some of these columns are not important to our analysis like homepage column which is the most column with missing data so we will drop it beside overview, keywords, tagline, imdp id which they will not be useful on our analysis.
# drop ['imdb_id','homepage','tagline','keywords', 'overview'] columns from the dataframe
df.drop(['imdb_id','homepage','tagline','keywords', 'overview'], axis=1, inplace=True)
df.info()
We can see now that there's missing data in multiple columns that we can use in our analysis but these columns are text columns like cast, genres and production companies that absloutely can't be replaced with any other data because that would cause false data. so we should drop the missing rows and work with the movies that there data is complete which are in total 9773 movie.
# dropping rows with null values
df.dropna(inplace=True)
df.info()
Now after we cleaned the missing data we should take care of the data types. We need to identify each correct data type for each column and assign it to that data type. We can see here that id and relase year are integers which are not true these data should be strings because making calculations on them won't make any sense. So we will convert both columns to strings. Also the release date is shown as object data type while it should be datetime data type so we will also convert the release date to date data type.
# convert id column from int to string
df['id'] = df['id'].astype(str)
# convert release year from int to string
df['release_year'] = df['release_year'].astype(str)
# convert release date from string to datetime data type
df['release_date'] = pd.to_datetime(df['release_date'])
df.info()
We Also here have 3 columns that contains lists of names or genres separated with '|' so we will convert them to lists.
# split each value in cast, genres, and production companies with using '|' delimeter
# and conver the columns to lists datatype
df['cast'] = df['cast'].str.split('|')
df['genres'] = df['genres'].str.split('|')
df['production_companies'] = df['production_companies'].str.split('|')
Now Lets take a look about the data itself so we can ecognize outliers and data that doesn't make any sense.
# show summary statistics for the numeric values in the dataset
df.describe()
We can notice from the previous output some points that need to be took care of.
The range of the popularity doesn't make any sense we can see that 75% of the data between 0.000188 and 0.776380 which mean that its in the range of 0 to 1 but we can see that the maximum value is more than 32 which doesn't make any sense in terms of the data given.
The runtime column have minimum of 0 which of course doesn't make any sense because absloutly it's not convenient to have a movie that has 0 runtime also we can see that the maximum value is 877 minutes which is more than 14 hours and also it doesn't make any sense to have a movie that is last for that long time.
We can say now from the previous two observations that there's outliers in this dataset that need to be moved.
# removing rows with outliers from the dataset
df = df[(np.abs(stats.zscore(df[['popularity','runtime']])) < 3).all(axis=1)].reset_index(drop=True)
df.describe()
We can see also that there are some rows with 0 revenue and 0 budget which of course can be considered as missing data so we will drop rows with this values so the data can be clean and effictive in our hypothesis.
df = df[df['revenue'] != 0]
df.describe()
Now to make the dataset more clean and prepared for analysis we need to expand the dataframe to have a separate row to each value in the lists in cast, genres, and production companies.
# Expanding each column from the columns of type lists so we can analyze data regarding each value in the lists
# we saved the expanded dataframe in a new datafrmae for each column
df_cast_expanded = df.explode('cast')
df_genres_expanded = df.explode('genres')
df_production_companies_expanded = df.explode('production_companies')
Tip: Now that we've trimmed and cleaned our data, we're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions.
# Showing the size of the dataset and the range of years this dataset fall in.
print("Dataset Size: {}, From year {}, to year {}".format(len(df),df['release_year'].min(),df['release_year'].max()))
Number of Movies, Revenue, and Budget for Latest 10 years in the dataset from 2006 to 2015
# calculating the count of movies in each year in the last 10 years of the dataset
top_10_year_count = df[['release_year','id']].groupby('release_year').count().sort_values('release_year',ascending=False).head(10)
# calculating the mean of revenue in each year in the last 10 years of the dataset
top_10_year_revenue = df[['release_year','revenue']].groupby('release_year').mean().sort_values('release_year',ascending=False).head(10)
# calculating the mean of budget in each year in the last 10 years of the dataset
top_10_year_budget = df[['release_year','budget']].groupby('release_year').mean().sort_values('release_year',ascending=False).head(10)
We can see here that the most year among these 10 years that movies had produced on is 2011. And we can see that the Least year with Number of Movies is 2015.
# ploting a barplot visualize last 10 years in terms of count of movies for each year
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
sns.barplot(x = top_10_year_count.index ,y= top_10_year_count['id'] )
plt.ylabel("Number of Movies")
plt.xlabel("")
plt.title("Number of Movies Per Year From (2006 - 2015)")
plt.xticks(rotation=90);
Here We can see that the average reveune for each year differentiate but we can see that the best average made for revenues were in 2007 while the worst was in 2015. Also we can notice here that the revenue started to decrease after 2012 while it was kind of stable in period from 2007 to 2012.
# ploting a barplot visualize last 10 years in terms of average of revenue for each year
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
sns.barplot(x = top_10_year_revenue.index ,y= top_10_year_revenue['revenue'] )
plt.ylabel("Average Revenue")
plt.xlabel("")
plt.title("Average Revenue Per Year From (2006 - 2015)")
plt.xticks(rotation=90);
The decreasing of the budget is noticable and of course this explains how the descreasing of revenue is making sense. Of Course when Movies Makers start to decrease there movies budget the quality will not be great and in terms of this the revenue wil of course decrease too.
# ploting a barplot visualize last 10 years in terms of average of budget for each year
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
sns.barplot(x = top_10_year_budget.index ,y= top_10_year_budget['budget'] )
plt.ylabel("Average Budget")
plt.xlabel("")
plt.title("Average Budget Per Year From (2006 - 2015)")
plt.xticks(rotation=90)
Top 10 Movies in terms of 'revenue', 'Budget', and 'Rating'
We can see here that Movie wth Name Transformers : Dark of the Moon made the highest revenue in the dataset. These are the Top 10 Highest Movies in terms of Revenue and Now we will see some other attributes values for this top 10 Movies.
# ploting a barplot visualize top 10 movies in terms of revenue
top_10_movies_revenue = df.groupby('original_title').mean().sort_values('revenue',ascending=False).head(10).reset_index()
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='original_title', data=top_10_movies_revenue )
# setting text on bars
for index, row in top_10_movies_revenue.iterrows():
g.text(row.revenue+10000000,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Revenue")
plt.title("Revenue Per Movie For The top 10 Highest Movies In Term of Revenue From (1960 - 2015)");
We can see here that amount of budget is differentiate between Movies but it absloutly shows that the budget can be lower than others but it will get highest revenue and this lead us to make a hypothesis that when budget extremly increase it can cause of decreasing revenue but of course when it extremly decrease that wil also harm the revenue increasing due to bad production factor.
# ploting a barplot visualize top 10 movies in terms of revenue
top_10_movies_revenue = df.groupby('original_title').mean().sort_values('revenue',ascending=False).head(10).reset_index()
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='original_title', data=top_10_movies_revenue )
# setting text on bars
for index, row in top_10_movies_revenue.iterrows():
g.text(row.budget+10000000,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Budget")
plt.title("Budget Per Movie For The top 10 Highest Movies In Term of Revenue From (1960 - 2015)");
Here's We can see that the Highest Rating didn't get very high revenue and by this we can say the the rating isn't a facor to determine if this movie is successfully making revenues or not.
# ploting a barplot visualize top 10 movies in terms of rating
top_10_movies_rating = df.groupby('original_title').mean().sort_values('vote_average',ascending=False).head(10).reset_index()
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='original_title', data=top_10_movies_rating )
# setting text on bars
for index, row in top_10_movies_rating.iterrows():
g.text(row.revenue+29999999,row.name, 'Revenue = '+str(round(row.revenue,2))+', Rating = '+str(round(row.vote_average,2)), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Revenue")
plt.title("Revenue Per Movie For The top 10 Highest Movies In Term of Rating From (1960 - 2015)");
From our finding here we can see that Rating not factor that can determine with the revenuse since the top 10 rated movies differentiate in the revenue some of them are getting high revenue but some others are not.
This visualization shows that increasing and decreasing budget must be controled so it can get the best revenue. To make a movie successfully be top in revenue you must not extremly decrease or extremly increase the budget. The movie must be had enough budget that make it well produced but to increase the revenue you must not to extremly increase the budget either.
# ploting a barplot visualize top 10 movies in terms of budget
top_10_movies_budget = df.groupby('original_title').mean().sort_values('budget',ascending=False).head(10).reset_index()
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (24, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='original_title', data=top_10_movies_budget )
# setting text on bars
for index, row in top_10_movies_budget.iterrows():
g.text(row.revenue+199990000,row.name, 'Revenue = '+str(round(row.revenue,2))+', Budget = '+str(round(row.budget,2)), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Revenue")
plt.title("Revenue Per Movie For The top 10 Highest Movies In Term of Budget From (1960 - 2015)");
Here we can see that pipularity is good for increasing revenue but it's not a factor to determine whether the movie is getting successful revenue or not.
# ploting a barplot visualize top 10 movies in terms of popularity
top_10_movies_popularity = df.groupby('original_title').mean().sort_values('popularity',ascending=False).head(10).reset_index()
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (24, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='original_title', data=top_10_movies_popularity )
# setting text on bars
for index, row in top_10_movies_popularity.iterrows():
g.text(row.revenue+199990000,row.name, 'Revenue = '+str(round(row.revenue,2))+', Popularity = '+str(round(row.popularity,2)), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Revenue")
plt.title("Revenue Per Movie For The top 10 Highest Movies In Term of Popularity From (1960 - 2015)");
We can see from the previous plot that the revenue decreases with increasing of the budget. And this is make us notice that there's a relation ship between revenue and budget in reverse.
# ploting revenue againest budget for the top 10 movies with highest budget
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (24, 10)})
plt.plot(df.sort_values('budget',ascending=False).head(10).sort_values('revenue',ascending=False)['original_title'],df.sort_values('budget',ascending=False).head(10).sort_values('revenue',ascending=False)['budget'])
plt.plot(df.sort_values('budget',ascending=False).head(10).sort_values('revenue',ascending=False)['original_title'],df.sort_values('budget',ascending=False).head(10).sort_values('revenue',ascending=False)['revenue'])
plt.legend(['Budget','Revenue'])
plt.xticks(rotation=90)
plt.ylabel("Revenue/Budget")
plt.xlabel("")
plt.title("Revenue Againest Budget for The top 10 Highest Movies In Term of Rating From (1960 - 2015)")
plt.show();
# getting the top 10 actors with highest number of movies associated with the dataset
cast_top_10_count = df_cast_expanded[['cast','id']].groupby('cast').count().sort_values('id',ascending=False).head(10)
cast_top_10_count
# calculating the mean of the revenue, budget, popularity and, Rating for the top 10 actors in terms of count of movies
top_10_cast_movie_count = df_cast_expanded[df_cast_expanded['cast'].isin(cast_top_10_count.index)]
top_10_cast_movie_count = top_10_cast_movie_count.groupby('cast').mean()
top_10_cast_movie_count
top_10_cast_movie_count.reset_index(inplace=True) # reseting the index of the dataframe
Here we can see the List of Top 10 Actors/Actresses That Filmed The most number of Movies from 1960 to 2015 and there average revenue.
# the average revenue for top 10 actors in terms of movie counts
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='cast', data=top_10_cast_movie_count)
# setting text on bars
for index, row in top_10_cast_movie_count.iterrows():
g.text(row.revenue,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Number of Movies From (1960 - 2015)");
We can see that the budget of those Actors are high and it make sense. This actors are very popular they film alot of movies yearly and of course there movies will have high budget.
# the average budget for top 10 actors in terms of movie counts
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='cast', data=top_10_cast_movie_count)
# setting text on bars
for index, row in top_10_cast_movie_count.iterrows():
g.text(row.budget,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Budget")
plt.title("Average Budget Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Number of Movies From (1960 - 2015)");
We can see here that there popularity is high and this is make sense too.
# the average popularity for top 10 actors in terms of movie counts
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='popularity',y='cast', data=top_10_cast_movie_count)
# setting text on bars
for index, row in top_10_cast_movie_count.iterrows():
g.text(row.popularity,row.name, round(row.popularity,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Popularity")
plt.title("Average Popularity Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Number of Movies From (1960 - 2015)");
We can see here that there Movies Average Rating is high.
# the average rating for top 10 actors in terms of movie counts
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='vote_average',y='cast', data=top_10_cast_movie_count)
# setting text on bars
for index, row in top_10_cast_movie_count.iterrows():
g.text(row.vote_average,row.name, round(row.vote_average,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Rating")
plt.title("Average Rating Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Number of Movies From (1960 - 2015)");
We can see here the most 10 actors that there movies got the highest revenue.
# grouping data by each actor/actress
cast_top_10_revenue = df_cast_expanded[['cast','revenue']].groupby('cast').sum().sort_values('revenue',ascending=False).head(10)
cast_top_10_revenue
# calculating the mean of the revenue, budget, popularity and, Rating for the top 10 actors in terms of highest revenues
top_10_cast_movie_revenue = df_cast_expanded[df_cast_expanded['cast'].isin(cast_top_10_revenue.index)]
top_10_cast_movie_revenue = top_10_cast_movie_revenue.groupby('cast').mean()
top_10_cast_movie_revenue
top_10_cast_movie_revenue.sort_values('revenue',ascending=False,inplace=True)
top_10_cast_movie_revenue.reset_index(inplace=True)
In this Plot we visualize the Average revenue for The top 10 Highest Actors in Terms of Revenue
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='cast', data=top_10_cast_movie_revenue)
# setting text on bars
for index, row in top_10_cast_movie_revenue.iterrows():
g.text(row.revenue,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Revenue From (1960 - 2015)");
In this Plot we visualize the Average Budget for The top 10 Highest Actors in Terms of Revenue
# Average budget of top 10 Actors in terms of highest revenue
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='cast', data=top_10_cast_movie_revenue)
# setting text on bars
for index, row in top_10_cast_movie_revenue.iterrows():
g.text(row.budget,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Budge")
plt.title("Average Budget Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Revenue From (1960 - 2015)");
In this Plot we visualize the Average Popularity for The top 10 Highest Actors in Terms of Revenue
# Average popularity of top 10 Actors in terms of highest revenue
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='popularity',y='cast', data=top_10_cast_movie_revenue)
# setting text on bars
for index, row in top_10_cast_movie_revenue.iterrows():
g.text(row.popularity,row.name, round(row.popularity,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Popularity")
plt.title("Average Popularity Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Revenue From (1960 - 2015)");
In this Plot we visualize the Average Rating for The top 10 Highest Actors in Terms of Revenue
# Average Rating of top 10 Actors in terms of highest revenue
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='vote_average',y='cast', data=top_10_cast_movie_revenue)
# setting text on bars
for index, row in top_10_cast_movie_revenue.iterrows():
g.text(row.vote_average,row.name, round(row.vote_average,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Rating")
plt.title("Average Rating Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Revenue From (1960 - 2015)");
# grouping dataframe with each actor/actress in terms of highest budget
cast_top_10_budget = df_cast_expanded[['cast','budget']].groupby('cast').sum().sort_values('budget',ascending=False).head(10)
cast_top_10_budget
top_10_cast_movie_budget = df_cast_expanded[df_cast_expanded['cast'].isin(cast_top_10_budget.index)]
top_10_cast_movie_budget = top_10_cast_movie_budget.groupby('cast').mean()
top_10_cast_movie_budget
top_10_cast_movie_budget.reset_index(inplace=True)
In this Plot we visualize the Average revenue for The top 10 Highest Actors in Terms of Budget
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='cast', data=top_10_cast_movie_budget)
# setting text on bars
for index, row in top_10_cast_movie_budget.iterrows():
g.text(row.revenue,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Budget From (1960 - 2015)");
In this Plot we visualize the Average Budget for The top 10 Highest Actors in Terms of Budget.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='cast', data=top_10_cast_movie_budget)
# setting text on bars
for index, row in top_10_cast_movie_budget.iterrows():
g.text(row.budget,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Budget")
plt.title("Average Budget Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Budget From (1960 - 2015)");
In this Plot we visualize the Average Budget for The top 10 Highest Actors in Terms of Budget.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='vote_average',y='cast', data=top_10_cast_movie_budget)
# setting text on bars
for index, row in top_10_cast_movie_budget.iterrows():
g.text(row.vote_average,row.name, round(row.vote_average,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Rating")
plt.title("Average Rating Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Budget From (1960 - 2015)");
In this Plot we visualize the Average Budget for The top 10 Highest Actors in Terms of Budget.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='popularity',y='cast', data=top_10_cast_movie_budget)
# setting text on bars
for index, row in top_10_cast_movie_budget.iterrows():
g.text(row.popularity,row.name, round(row.popularity,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Popularity")
plt.title("Average Popularity Per Actor/Actress's Movies For The top 10 Highest Cast In Term of Budget From (1960 - 2015)");
# grouping dataframe with each genre and calculate the mean for each numeric column in the group
genres = df_genres_expanded.groupby('genres').mean().sort_values('popularity',ascending=False)
genres
genres.reset_index(inplace=True)
We can see that the most popular genre is Animation while the least popular genre is Foreign.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='popularity',y='genres', data=genres)
# setting text on bars
for index, row in genres.iterrows():
g.text(row.popularity,row.name+0.3, round(row.popularity,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Popularty")
plt.title("Average Popularity For Movies Per Genre From (1960 - 2015)");
We can see here that the most 5 popular genres are Animation, Adventures, Fantasy, Action and Science Fiction
genres = genres.sort_values('budget',ascending=False).reset_index(drop=True)
genres
We can see that the highest budget average made for Animation Genre while the Lowest also Made for Foriegn Genre
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='genres', data=genres)
# setting text on bars
for index, row in genres.iterrows():
g.text(row.budget+2000500,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Budget")
plt.title("Average Budget For Movies Per Genre From (1960 - 2015)");
Here we can see that the most 5 genres that take highest budget are Adventures, Fantasy, Animation, Action, and Family
genres = genres.sort_values('revenue',ascending=False).reset_index(drop=True)
genres
We can see that the highest revenue average got for Animation Genre while the Lowest also Made for Foriegn Genre
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='genres', data=genres)
# setting text on bars
for index, row in genres.iterrows():
g.text(row.revenue+10000000,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue For Movies Per Genre From (1960 - 2015)");
From these plots we can see that Genres are a good factor to determine revenue with budget and popularity together.
And Also we can see that the same top 5 genres take highest budget are getting highest revenue.
# grouping dataframe with production companies in terms of choosing highest 10 budget production companies
prod_comp_budget = df_production_companies_expanded[['production_companies','budget']].groupby('production_companies').mean().sort_values('budget',ascending=False).head(10)
prod_comp_budget
production_companies_budget = df_production_companies_expanded[df_production_companies_expanded['production_companies'].isin(prod_comp_budget.index)].groupby('production_companies').mean().reset_index()
production_companies_budget
production_companies_budget = production_companies_budget.sort_values('budget' ,ascending=False).reset_index(drop=True) # sort dataframe based on budget
In this plot we visualize the Average Budget for movies made by specific production company. This is regarding the top 10 production companies in terms of budget.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='production_companies', data=production_companies_budget)
# setting text on bars
for index, row in production_companies_budget.iterrows():
g.text(row.budget+30000000,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Budget")
plt.title("Average Budget Per Production Company for Top 10 Production Companies in terms of Budget From (1960 - 2015)");
production_companies_budget = production_companies_budget.sort_values('revenue' ,ascending=False).reset_index(drop=True) # sort data based on revenue
In this plot we visualize the Average Revenue for movies made by specific production company. This is regarding the top 10 production companies in terms of budget.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='production_companies', data=production_companies_budget)
# setting text on bars
for index, row in production_companies_budget.iterrows():
g.text(row.revenue+100000000,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue Per Production Company for Top 10 Production Companies in terms of Budget From (1960 - 2015)");
# Choose the highest 10 production companies revenue
prod_comp_revenue = df_production_companies_expanded[['production_companies','revenue']].groupby('production_companies').mean().sort_values('revenue',ascending=False).head(10)
prod_comp_revenue
production_companies_revenue = df_production_companies_expanded[df_production_companies_expanded['production_companies'].isin(prod_comp_revenue.index)].groupby('production_companies').mean().reset_index()
production_companies_revenue
production_companies_revenue = production_companies_revenue.sort_values('revenue',ascending=False).reset_index(drop=True) # sort data based on revenue
In this plot we visualize the Average Revenue for movies made by specific production company. This is regarding the top 10 production companies in terms of Revenue.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='revenue',y='production_companies', data=production_companies_revenue)
# setting text on bars
for index, row in production_companies_revenue.iterrows():
g.text(row.revenue+10000000,row.name, round(row.revenue,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Revenue")
plt.title("Average Revenue Per Production Company for Top 10 Production Companies in terms of Revenue From (1960 - 2015)");
production_companies_revenue = production_companies_revenue.sort_values('budget',ascending=False).reset_index(drop=True) # sort data based on budget
In this plot we visualize the Average Budget for movies made by specific production company. This is regarding the top 10 production companies in terms of Revenue.
# plot properties
sns.set_style('whitegrid')
sns.set_context({"figure.figsize": (20, 10)})
sns.set(font_scale=2) # scale of font size of the plot
g = sns.barplot(x='budget',y='production_companies', data=production_companies_revenue)
# setting text on bars
for index, row in production_companies_revenue.iterrows():
g.text(row.budget+10000000,row.name, round(row.budget,2), color='black', ha="center")
plt.ylabel("")
plt.xlabel("Average Budget")
plt.title("Average Budget Per Production Company for Top 10 Production Companies in terms of Revenue From (1960 - 2015)");
g = sns.PairGrid(df, vars=["popularity", "budget",'revenue','runtime','vote_average'])
g.map(plt.scatter);
plt.plot(df.sort_values('vote_average')['vote_average'],df.sort_values('vote_average')['popularity'],linewidth=3)
plt.xlabel('Rating')
plt.ylabel('Popularity')
plt.title("Relationship between Rating and Poplarity For All Movies in the Dataset");
In this plot we can see that with increasing of the vote increasing in the popularity occur which maks a good relationship and of course make sense.
And Also from the Matrix plot we can see that the relationship between budget and revenue is high beside of course voting andf popularity.
Also Runtime is a great factor when analyzing Movies since some hypothesis can be proved like that when runtime increase of course the budget increases too. An exciting finiding here is shown related to the runtime that when it increases also revenue increase which is strange but it can be true.
What was challenging while working in this dataset was :
in the Cleaning step identifying outliers and removing them also beside dealing with missing data. It was very chellenging taking the step of dropping missing data or filling it but the case we had here can't fill the missing data with any alternative so the best decision we see is to drop them so the analysis can be more clear and convenient.
Finding right questions was really challenging for me since there are a lot of questions that can be asked regarding this dataset but the important thing is to ask the right question that can support your analysis specially that the key hypothesis I chose to investigate here the relation between budget and revenue. Also finding other factors that effect the revenue.
Chosing the right visualizations was very challenging and took time to finally find the best viualizations that can prove my findings.
Tip: Finally, We can Now finalize our fnding here in some points:
The Popular movies are close to be high rated than others and this we proved it with visualization.
In The latest 5 years that are shown in the dataset the movie making industry had lower budget in movies which absloutly leads to decreasing in the revenue and that was very obvious from the visulaizations.
We fount that the top 5 generes that take higher budegt also give higher revenue.
We Also Explored the dataset in terms of cast and we found that most of the actors that there movies had hgh budget they also have higher revienue.
From the ineteresting things as we discussed earlier the runtime factor and its affects on the growth of revenue and budget.
Finally we can conclude from all of this that this dataset can be very useful to indicates the factors that can affect the movie industry like time, runtime, budget, and actors popularity.
This Analysis report is the a start for exploring more about this case and of course there will be future work like using ML to predict the revenue of future published movies in term of past movies data and asloutly getting more data from this case will be more useful for our case study here.